Table of contents:
Import required libraries. Before running the notebook, it is assumed that the user has already installed the required libraries contained in requirements.txt .
import os
import datetime as dt
import folium
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import pylab
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from tqdm.notebook import tqdm
from IPython.display import HTML
# Modify to adjust figure sizes
pylab.rcParams['figure.figsize'] = (12, 4)
Since the notebooks and utility functions are under src folder of the repository, os.chdir("..") is used as an easy way to move back one directory to access the folders. This serves as a workaround for the initial exploration.
os.chdir("..")
from utils.config import filenames
from utils import datautils
Modify the DATA_LOCATION directory as needed. This configuration sets it as src/data in the repository. This assumes that the required data is already located within the directory e.g.
us-traffic
│ ...
│
└───src
| | datasetdownloader.py
│ │ ...
│ │
│ └───data <- (DATA_LOCATION)
| | dot_traffic_2015.txt.gz
| | dot_traffic_stations_2015.txt.gz
│ |
| └─── ...
|
...
Alternatively, the user can also download the files via the datadownloader.py script under src.
DATA_LOCATION = os.path.join(os.getcwd(), 'data')
There are two main files under the dataset:
traffic_data, traffic_stations = datautils.load_traffic_datasets(
DATA_LOCATION,
filenames["TRAFFIC_DATA"],
filenames["TRAFFIC_STATIONS"])
Loading traffic data from 'dot_traffic_2015.txt.gz' ... Loading traffic stations from 'dot_traffic_stations_2015.txt.gz' ... Finished loading data.
Upon checking the initial entries for both dataframes, we can see that traffic_data mainly contains historical information regarding traffic volume and traffic_stations contains the data regarding descriptions of the stations which collected the traffic volume information.
traffic_data.head()
| date | day_of_data | day_of_week | direction_of_travel | direction_of_travel_name | fips_state_code | functional_classification | functional_classification_name | lane_of_travel | month_of_data | ... | traffic_volume_counted_after_1500_to_1600 | traffic_volume_counted_after_1600_to_1700 | traffic_volume_counted_after_1700_to_1800 | traffic_volume_counted_after_1800_to_1900 | traffic_volume_counted_after_1900_to_2000 | traffic_volume_counted_after_2000_to_2100 | traffic_volume_counted_after_2100_to_2200 | traffic_volume_counted_after_2200_to_2300 | traffic_volume_counted_after_2300_to_2400 | year_of_data | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-04-07 | 7 | 3 | 1 | North | 56 | 3R | Rural: Principal Arterial - Other | 1 | 4 | ... | 89 | 122 | 124 | 110 | 69 | 73 | 28 | 12 | 6 | 15 |
| 1 | 2015-09-26 | 26 | 7 | 7 | West | 21 | 1U | Urban: Principal Arterial - Interstate | 2 | 9 | ... | 1617 | 1669 | 1308 | 1068 | 928 | 885 | 798 | 650 | 613 | 15 |
| 2 | 2015-06-16 | 16 | 3 | 3 | East | 6 | 1U | Urban: Principal Arterial - Interstate | 0 | 6 | ... | 4244 | 4405 | 4609 | 4361 | 3272 | 2243 | 2050 | 1453 | 892 | 15 |
| 3 | 2015-04-26 | 26 | 1 | 5 | South | 55 | 1U | Urban: Principal Arterial - Interstate | 1 | 4 | ... | 1011 | 959 | 851 | 708 | 559 | 457 | 297 | 207 | 110 | 15 |
| 4 | 2015-05-23 | 23 | 7 | 3 | East | 4 | 4R | Rural: Minor Arterial | 0 | 5 | ... | 83 | 61 | 55 | 35 | 29 | 21 | 23 | 9 | 7 | 15 |
5 rows × 38 columns
print("traffic_data row count:", traffic_data.shape[0])
print("traffic_data column count:", traffic_data.shape[1])
traffic_data row count: 7140391 traffic_data column count: 38
traffic_stations.head()
| algorithm_of_vehicle_classification | algorithm_of_vehicle_classification_name | calibration_of_weighing_system | calibration_of_weighing_system_name | classification_system_for_vehicle_classification | concurrent_route_signing | concurrent_signed_route_number | direction_of_travel | direction_of_travel_name | fips_county_code | ... | sample_type_for_vehicle_classification_name | second_type_of_sensor | shrp_site_identification | station_id | station_location | type_of_sensor | type_of_sensor_name | year_of_data | year_station_discontinued | year_station_established | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | 13 | 3 | 091 | 7 | West | 59 | ... | Station not used for Heavy Vehicle Travel Info... | N | NaN | 129130 | LAKEVIEW AVENUE ORA91R10.091 | L | Inductance loop | 15 | 0 | 97 |
| 1 | NaN | NaN | NaN | NaN | 13 | 3 | 099 | 5 | South | 77 | ... | Station not used for Heavy Vehicle Travel Info... | N | NaN | 100190 | LITTLE JOHN CREEK SJ9912.526 | L | Inductance loop | 15 | 0 | 97 |
| 2 | G | Axle spacing with Scheme F modified | NaN | NaN | 15 | 1 | 005 | 1 | North | 93 | ... | Station used for Heavy Vehicle Travel Informat... | N | NaN | 022940 | EDGEWOOD SIS5R22.999 | P | Piezoelectric | 15 | 0 | 69 |
| 3 | D | Vehicle length classification | M | Moving average of the steering axle of 3S2s | 13 | 0 | NaN | 5 | South | 35 | ... | Station not used for Heavy Vehicle Travel Info... | NaN | NaN | 000302 | I 15 12900 South M.P. 290.6 | X | Radio wave | 15 | 0 | 11 |
| 4 | G | Axle spacing with Scheme F modified | 0 | NaN | 14 | 1 | 000000 | 7 | West | 27 | ... | Station not used for Heavy Vehicle Travel Info... | NaN | NaN | W01136 | E. of Franklin Rd Underpass | L | Inductance loop | 15 | 0 | 95 |
5 rows × 55 columns
print("traffic_stations row count:", traffic_stations.shape[0])
print("traffic_stations column count:", traffic_stations.shape[1])
traffic_stations row count: 28466 traffic_stations column count: 55
By checking the columns, we can see that there are some common columns for both traffic_data and traffic_stations. We can use the common columns to match relevant features (e.g. spatial information) from traffic_stations to traffic_data entries when we do preprocessing for model input later on.
common_cols = set(traffic_data.columns) & set(traffic_stations.columns)
common_cols
{'direction_of_travel',
'direction_of_travel_name',
'fips_state_code',
'functional_classification',
'functional_classification_name',
'lane_of_travel',
'record_type',
'station_id',
'year_of_data'}
While it's possible to match the rows through station_id alone, it relies on the assumption that there is a one to one correspondence between a specific location/set of coordinates to a single station_id.
This assumption may be proven false below.
sample_station = "000302"
spatial_columns = ["fips_state_code",
"fips_county_code",
"latitude", "longitude",
"functional_classification_name"]
sub_df = traffic_stations[traffic_stations["station_id"] == sample_station]
sub_df = sub_df[spatial_columns + ["station_id"]].drop_duplicates()
display(sub_df)
| fips_state_code | fips_county_code | latitude | longitude | functional_classification_name | station_id | |
|---|---|---|---|---|---|---|
| 3 | 49 | 35 | 40.516500 | 111.891520 | Urban: Principal Arterial - Interstate | 000302 |
| 7540 | 8 | 35 | 39.430691 | 104.966383 | Rural: Major Collector | 000302 |
| 8245 | 19 | 47 | 41.910000 | 95.228300 | Rural: Major Collector | 000302 |
| 14812 | 48 | 29 | 29.299390 | 98.628500 | Urban: Principal Arterial - Interstate | 000302 |
| 18605 | 13 | 51 | 32.058600 | 81.022470 | Urban: Principal Arterial - Other | 000302 |
As seen in the sub dataframe above, the station_id "000302" is not unique to a specific location and appears across different longitude, latitude, and FIPS codes.
Therefore, to get the most appropriate spatial features for a specific row in traffic_data we need to make use of other common columns for matching.
Since there are multiple columns with numerical encoding and named equivalent for both traffic_data and traffic_stations that indicate the same information (e.g. direction_of_travel and direction_of_travel_name), we can check the assignment and validate if the numerical encoding is unique to each value.
def check_correspondence(df, col1_name, col2_name):
for var1 in df[col1_name].unique():
var2 = df[df[col1_name] == var1][col2_name].unique()
print(var1, var2)
It can be seen that there are no repeated entries except for "Other lanes" in lane_of_travel_name. However it is still valid as the multiple entries under "Other lanes" only appear under it and not for the other travel lane names.
Therefore, we can opt to just retrieve the columns with "_name" and perform encoding before modeling.
name_cols = []
for col in set(traffic_data.columns) | set(traffic_stations.columns):
if "name" in col:
name_cols.append(col)
for col in name_cols:
print(col, ":")
check_correspondence(traffic_stations,
col,
col[:-5])
print("--")
method_of_vehicle_classification_name : nan [] Permanent vehicle classification device [3] Portable vehicle classification device [2] Human observation (manual) vehicle classification [1] -- direction_of_travel_name : West [7] South [5] North [1] East [3] North-South or Northeast-Southwest combined (ATR stations only) [9] East-West or Southeast-Northwest combined (ATR stations only) [0] Northeast [2] Southwest [6] Northwest [8] Southeast [4] -- functional_classification_name : Urban: Principal Arterial - Other Freeways or Expressways ['2U'] Rural: Principal Arterial - Other ['3R'] Rural: Principal Arterial - Interstate ['1R'] Urban: Principal Arterial - Interstate ['1U'] Urban: Principal Arterial - Other ['3U'] Rural: Minor Arterial ['4R'] Urban: Minor Arterial ['4U'] Urban: Collector ['5U'] Rural: Major Collector ['5R'] Rural: Minor Collector ['6R'] Urban: Local System ['7U'] Rural: Local System ['7R'] -- type_of_sensor_name : Inductance loop ['L'] Piezoelectric ['P'] Radio wave ['X'] Quartz piezoelectric - NEW ['Q'] Microwave ['W'] nan [] Road tube ['R'] Human observation (manual) ['H'] Bending plate ['B'] Ultrasonic ['U'] Other ['Z'] Sonic/acoustic ['S'] Infrared ['I'] Strain gauge on bridge beam ['G'] Video image ['V'] Hydraulic load cells ['E'] Automatic vehicle identification (AVI) ['A'] Laser/lidar ['K'] Magnetometer ['M'] Fiber optic - NEW ['F'] -- calibration_of_weighing_system_name : nan [] Moving average of the steering axle of 3S2s ['M'] Other method ['Z'] Test trucks only ['T'] Combination of test trucks and trucks from the traffic stream (but not ASTM E1318) ['C'] ASTM Standard E1318 ['A'] Uncalibrated ['U'] Other sample of trucks from the traffic stream ['D'] Subset of ASTM Standard E1318 ['B'] Static calibration ['S'] -- method_of_traffic_volume_counting_name : Permanent automatic traffic recorder (ATR) [3] Portable traffic recording device [2] nan [] Human observation (manual) [1] -- primary_purpose_name : Planning or traffic statistics purposes ['P'] Research purposes (e.g. LTPP) ['R'] Load data for pavement design or pavement management purposes ['L'] Operations purposes but not ITS ['O'] nan [] Operations purposes in support of ITS initiatives ['I'] Enforcement purposes (e.g. speed or weight enforcement) ['E'] -- algorithm_of_vehicle_classification_name : nan [] Axle spacing with Scheme F modified ['G'] Vehicle length classification ['D'] Axle spacing with Scheme F ['F'] Axle spacing and other input(s) not specified above ['N'] Axle spacing and weight algorithm ['K'] Axle spacing and vehicle length algorithm ['L'] Other means not specified above ['Z'] Axle spacing weight and vehicle length algorithm ['M'] Other axle spacing algorithm ['H'] Human observation on site (manual) ['A'] Automated interpretation of vehicle image or signature (e.g. video microwave sonic) ['C'] Axle spacing with ASTM Standard E1572 ['E'] -- sample_type_for_vehicle_classification_name : Station not used for Heavy Vehicle Travel Information System ['N'] Station used for Heavy Vehicle Travel Information System ['H'] nan [] -- method_of_data_retrieval_name : Automated (telemetry) [2] nan [] Not automated (manual) [1] -- sample_type_for_truck_weight_name : nan [] Station used for TMG sample and Strategic Highway Research Program (SHRP) Long Term Pavement Performance (LTPP) sample ['B'] Station not used for any of the above ['N'] Station used for TMG sample (but not SHRP/LTPP sample) ['T'] Station used for SHRP/LTPP sample (but not TMG sample) ['L'] -- sample_type_for_traffic_volume_name : Station used for Traffic Volume Trends ['T'] nan [] Station not used for Traffic Volume Trends ['N'] -- method_of_truck_weighing_name : nan [] Portable weigh-in-motion system [4] Permanent weigh-in-motion system [5] Portable static scale [1] Chassis-mounted towed static scale [2] -- lane_of_travel_name : Other lanes [4 2 3 6 5 7 8 9] Outside (rightmost) lane [1] Data with lanes combined [0] --
The numbers under fips_state_code correspond to state names in the US. External information from government sites in US were retrieved to further understand FIPS data present in the dataset. We can check the states along with the number of entries under traffic_data.
We retrieve and process the following dataframes for reference:
row_dist = dict(traffic_data["fips_state_code"].value_counts())
fips_df, fips_loc_df = datautils.load_other_datasets(DATA_LOCATION,
filenames["FIPS_CODE"],
filenames["FIPS_LOC"])
Loading FIPS state codes reference from 'fips_code.csv' ... Loading approximate FIPS coordinates reference from 'fips_latlong.csv' ... Finished loading data.
fips_state_ref = datautils.create_fips_ref(fips_df)
state_dist_df = pd.DataFrame(row_dist.items(), columns=["fips_state_code",
"entries"])
state_dist_df["state_name"] = [fips_state_ref[fips] for
fips in row_dist.keys()]
state_dist_df.sort_values(by=["entries"], ascending=False, inplace=True)
state_dist_df
| fips_state_code | entries | state_name | |
|---|---|---|---|
| 0 | 12 | 606310 | florida |
| 1 | 51 | 488559 | virginia |
| 2 | 39 | 456486 | ohio |
| 3 | 13 | 355938 | georgia |
| 4 | 55 | 298473 | wisconsin |
| 5 | 53 | 266675 | washington |
| 6 | 16 | 252638 | idaho |
| 7 | 36 | 247120 | new york |
| 8 | 40 | 202750 | oklahoma |
| 9 | 28 | 194908 | mississippi |
| 10 | 29 | 191552 | missouri |
| 11 | 6 | 191057 | california |
| 12 | 1 | 183691 | alabama |
| 13 | 49 | 168085 | utah |
| 14 | 48 | 160638 | texas |
| 15 | 32 | 158472 | nevada |
| 16 | 26 | 156098 | michigan |
| 17 | 15 | 151394 | hawaii |
| 18 | 56 | 147387 | wyoming |
| 19 | 19 | 144872 | iowa |
| 20 | 4 | 133421 | arizona |
| 21 | 30 | 126627 | montana |
| 22 | 35 | 119045 | new mexico |
| 23 | 41 | 115959 | oregon |
| 24 | 10 | 111768 | delaware |
| 25 | 8 | 106887 | colorado |
| 26 | 44 | 102838 | rhode island |
| 27 | 20 | 98256 | kansas |
| 28 | 5 | 93872 | arkansas |
| 29 | 2 | 88234 | alaska |
| 30 | 21 | 85718 | kentucky |
| 31 | 42 | 81326 | pennsylvania |
| 32 | 27 | 74028 | minnesota |
| 33 | 33 | 67011 | new hampshire |
| 34 | 17 | 62715 | illinois |
| 35 | 24 | 61839 | maryland |
| 36 | 46 | 58861 | south dakota |
| 37 | 18 | 54365 | indiana |
| 38 | 54 | 51242 | west virginia |
| 39 | 47 | 47842 | tennessee |
| 40 | 34 | 47799 | new jersey |
| 41 | 23 | 44840 | maine |
| 42 | 37 | 44224 | north carolina |
| 43 | 45 | 42928 | south carolina |
| 44 | 22 | 42180 | louisiana |
| 45 | 31 | 35740 | nebraska |
| 46 | 25 | 33673 | massachusetts |
| 47 | 38 | 30383 | north dakota |
| 48 | 50 | 28028 | vermont |
| 49 | 9 | 21983 | connecticut |
| 50 | 11 | 3656 | district of columbia |
plt.hist(row_dist.values(), bins=10)
plt.show()
temporal_cols = ['date', 'day_of_data', 'day_of_week', 'month_of_data', 'year_of_data']
data_collection_cols = ['direction_of_travel_name',
'fips_state_code',
'functional_classification_name',
'lane_of_travel', 'record_type',
'restrictions', 'station_id']
traffic_vol_cols = [col for col in traffic_data.columns if 'traffic_volume' in col]
print("Null values for traffic volume in traffic_stations:",
traffic_data[traffic_vol_cols].isnull().sum().sum())
Null values for traffic volume in traffic_stations: 0
print("Null values for temporal columns in traffic_stations:",
traffic_data[temporal_cols].isnull().sum().sum())
Null values for temporal columns in traffic_stations: 0
print("Null values for temporal columns in traffic_stations:")
print(traffic_data[data_collection_cols].isnull().sum())
Null values for temporal columns in traffic_stations: direction_of_travel_name 0 fips_state_code 0 functional_classification_name 0 lane_of_travel 0 record_type 0 restrictions 7140391 station_id 0 dtype: int64
Since restrictions are just full of NaN columns and isn't a common column between traffic_data and traffic_stations, we can opt to drop/disregard it.
traffic_data["restrictions"].unique()
array([nan])
Check a sample state to check if the stations collect data for all the dates in the given year of 2015.
sub_state_df = traffic_data[traffic_data["fips_state_code"] == 9]
From this we can see that there are multiple entries per station in a given state. This is due to the direction of travel from the data collected. Since we are only interested in the collective traffic volume collected by the station in a specific location for a given time, we can compress the traffic volume later on by getting the sum.
sub_state_df = sub_state_df[sub_state_df["station_id"] == "009024"]
sub_state_df.sort_values(by="date")[["date",
"direction_of_travel_name",
"station_id"]]
| date | direction_of_travel_name | station_id | |
|---|---|---|---|
| 4472009 | 2015-01-01 | West | 009024 |
| 1290861 | 2015-01-01 | East | 009024 |
| 2646296 | 2015-01-02 | West | 009024 |
| 1269220 | 2015-01-02 | East | 009024 |
| 901436 | 2015-01-03 | East | 009024 |
| ... | ... | ... | ... |
| 3654540 | 2015-12-29 | East | 009024 |
| 5947767 | 2015-12-30 | East | 009024 |
| 2670560 | 2015-12-30 | West | 009024 |
| 4686220 | 2015-12-31 | East | 009024 |
| 805111 | 2015-12-31 | West | 009024 |
730 rows × 3 columns
For now, we can retrieve the unique dates regardless of direction of travel to get the dates where the station collected data in a given state. This is to see if there are gaps in the data. Since a 1 day gap would result in 24 data points lost, it would be hard to supplement the gap with a linear/average approximation due to the inherent seasonality of the data. However it would be possible to fill the gaps if the forecasting to be done is daily.
We can then adjust the approach when transforming the data as model input.
# Possible to supplement with entire list of FIPS state codes
# Displayed distribution for states with least amount of total entries
codes = state_dist_df[-6:]["fips_state_code"]
for fips_state_code in codes:
state_name = fips_state_ref[fips_state_code].upper()
sub_state_df = traffic_data[traffic_data["fips_state_code"] ==
fips_state_code]
num_data_points = []
stations = sub_state_df["station_id"].unique()
for station_id in stations:
entries = sub_state_df[sub_state_df["station_id"] ==
station_id]["date"].unique().shape[0]
num_data_points.append(entries)
plt.figure(figsize=(5,3))
plt.hist(num_data_points, bins = 10)
title = f"Dist. for dates present in data from stations in\n{state_name}"
plt.title(title)
plt.tight_layout()
plt.show()
If we check a portion of traffic_stations, there are a lot of null values but since we're mainly interested in analyzing traffic volume for the data, data from traffic_stations is mostly used to supplement our findings and add features for modeling. By checking the spatial columns, we narrow down the features we want to explore for now and note the other columns for future exploration.
print("Null values for temporal columns in traffic_stations:")
null_count_trafficstations = traffic_stations[spatial_columns].isnull().sum()
print(null_count_trafficstations)
Null values for temporal columns in traffic_stations: fips_state_code 0 fips_county_code 0 latitude 1 longitude 1 functional_classification_name 0 dtype: int64
Look for the row with the null longitude and latitude entry.
null_row = traffic_stations[traffic_stations[["longitude",
"latitude"]].isna().any(axis=1)]
null_row[["station_id", "station_location", "fips_state_code"]]
| station_id | station_location | fips_state_code | |
|---|---|---|---|
| 21473 | 000034 | Montgomery on Madison Ave. | 1 |
condition = (traffic_stations["station_id"] == "000034") & \
(traffic_stations["fips_state_code"] == 1)
traffic_stations[condition][["latitude", "longitude"]]
| latitude | longitude | |
|---|---|---|
| 4427 | 32.380392 | 86.286294 |
| 7488 | 32.380392 | 86.286294 |
| 13223 | 32.380392 | 86.286294 |
| 17168 | 32.380392 | 86.286294 |
| 19845 | 32.380392 | 86.286294 |
| 21473 | NaN | NaN |
Correct the longitude and latitude.
traffic_stations.loc[21473,"latitude"] = 32.380392
traffic_stations.loc[21473,"longitude"] = 86.286294
We assume that 0 entries throughout the dataset is normal since we know that there may be certain areas with lower traffic compared to urban or high density locations.
traffic_data[traffic_vol_cols].describe()
| traffic_volume_counted_after_0000_to_0100 | traffic_volume_counted_after_0100_to_0200 | traffic_volume_counted_after_0200_to_0300 | traffic_volume_counted_after_0300_to_0400 | traffic_volume_counted_after_0400_to_0500 | traffic_volume_counted_after_0500_to_0600 | traffic_volume_counted_after_0600_to_0700 | traffic_volume_counted_after_0700_to_0800 | traffic_volume_counted_after_0800_to_0900 | traffic_volume_counted_after_0900_to_1000 | ... | traffic_volume_counted_after_1400_to_1500 | traffic_volume_counted_after_1500_to_1600 | traffic_volume_counted_after_1600_to_1700 | traffic_volume_counted_after_1700_to_1800 | traffic_volume_counted_after_1800_to_1900 | traffic_volume_counted_after_1900_to_2000 | traffic_volume_counted_after_2000_to_2100 | traffic_volume_counted_after_2100_to_2200 | traffic_volume_counted_after_2200_to_2300 | traffic_volume_counted_after_2300_to_2400 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | ... | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 | 7.140391e+06 |
| mean | 1.145878e+02 | 7.874558e+01 | 6.622503e+01 | 7.016138e+01 | 1.171851e+02 | 2.454065e+02 | 4.334301e+02 | 5.833799e+02 | 5.774976e+02 | 5.600694e+02 | ... | 7.009825e+02 | 7.497160e+02 | 7.770437e+02 | 7.565536e+02 | 6.173322e+02 | 4.793756e+02 | 3.906426e+02 | 3.274747e+02 | 2.534447e+02 | 1.798298e+02 |
| std | 2.818492e+02 | 2.202875e+02 | 2.102642e+02 | 2.242483e+02 | 3.227085e+02 | 5.723301e+02 | 8.359078e+02 | 9.984941e+02 | 9.594217e+02 | 8.917308e+02 | ... | 1.092236e+03 | 1.143318e+03 | 1.173933e+03 | 1.172116e+03 | 1.061545e+03 | 9.203711e+02 | 8.290271e+02 | 7.989146e+02 | 7.284074e+02 | 6.901713e+02 |
| min | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | ... | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 | -1.000000e+00 |
| 25% | 1.300000e+01 | 8.000000e+00 | 7.000000e+00 | 7.000000e+00 | 1.200000e+01 | 2.800000e+01 | 5.600000e+01 | 9.000000e+01 | 1.070000e+02 | 1.230000e+02 | ... | 1.700000e+02 | 1.830000e+02 | 1.860000e+02 | 1.740000e+02 | 1.310000e+02 | 9.500000e+01 | 7.200000e+01 | 5.400000e+01 | 3.600000e+01 | 2.200000e+01 |
| 50% | 4.200000e+01 | 2.700000e+01 | 2.100000e+01 | 2.300000e+01 | 3.800000e+01 | 8.600000e+01 | 1.700000e+02 | 2.640000e+02 | 2.850000e+02 | 3.030000e+02 | ... | 4.090000e+02 | 4.380000e+02 | 4.520000e+02 | 4.320000e+02 | 3.370000e+02 | 2.520000e+02 | 1.980000e+02 | 1.550000e+02 | 1.090000e+02 | 7.000000e+01 |
| 75% | 1.260000e+02 | 8.500000e+01 | 7.000000e+01 | 7.500000e+01 | 1.180000e+02 | 2.410000e+02 | 4.570000e+02 | 6.570000e+02 | 6.570000e+02 | 6.450000e+02 | ... | 8.220000e+02 | 8.910000e+02 | 9.340000e+02 | 9.070000e+02 | 7.220000e+02 | 5.510000e+02 | 4.470000e+02 | 3.670000e+02 | 2.780000e+02 | 1.930000e+02 |
| max | 9.999900e+04 | 8.074100e+04 | 9.001700e+04 | 9.001200e+04 | 7.056000e+04 | 7.815900e+04 | 9.002000e+04 | 9.018700e+04 | 9.999900e+04 | 9.530000e+04 | ... | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 | 9.999900e+04 |
8 rows × 24 columns
At first glance, we can already see that there are anomalous entries in the dataset since there are entries with 0 longitude and latitude and even a ~990 entry for longitude. For reference, a quick google search would show that the coordinates for the US is 37.0902° N, 95.7129° W.
traffic_stations[["latitude", "longitude"]].describe()
| latitude | longitude | |
|---|---|---|
| count | 28466.000000 | 28466.000000 |
| mean | 38.140872 | 100.190810 |
| std | 7.423779 | 77.899394 |
| min | 0.000000 | 0.000000 |
| 25% | 34.052574 | 80.576670 |
| 50% | 39.302096 | 88.209350 |
| 75% | 42.252370 | 111.937942 |
| max | 99.900010 | 993.387110 |
Visually below, we can see that there are some stations located outside of the US. We can also see that in some parts, stations are sparse and could also attribute to lower daily traffic volume data collected.
coors = traffic_stations[["latitude", "longitude"]].drop_duplicates()
# US Coordinates
gen_map = folium.Map(location=[37.0902, -95.7129],
zoom_start=3)
coors.apply(lambda row:folium.CircleMarker(location=
[row["latitude"],
row["longitude"]*-1],
radius=1, color="red"
).add_to(gen_map), axis=1)
display(gen_map)
# Create utility functions for easier reuse/plots
def visualize_stations(fips_state_code, lat=None, long=None, zoom=6, long_neg=True,
to_display=True, to_save=False, save_dir=None):
if long_neg == True:
multiplier = -1
else:
multiplier = 1
if lat == None or long == None:
lat, long = get_latlong(fips_loc_df, fips_state_code)
sub_df = traffic_stations[traffic_stations["fips_state_code"] ==
fips_state_code]
m = folium.Map(location=[lat, long*multiplier],
zoom_start=zoom)
sub_df.apply(lambda row:folium.CircleMarker(location=[row["latitude"],
row["longitude"]*multiplier],
radius=2, color="red",
popup=(row["station_id"],
row["latitude"],
row["longitude"])).add_to(m), axis=1)
if to_display:
display(m)
if to_save:
m.save(os.path.join(save_dir, f"{fips_state_code}.html"))
def get_latlong(latlong_df, fips_state_code):
row = latlong_df[latlong_df["fips_code"] == fips_state_code]
return row["latitude"].item(), row["longitude"].item()
From the plots below, we can see that some entries for the stations are located outside of the US. There seems to be a common pattern wherein the offset is off by 1 tens place (e.g. 987 should be 98.7 relative to the coordinates of the US).
visualize_stations(fips_state_code=46,
lat=44.067112,
long=987.602740,
zoom=4)
visualize_stations(fips_state_code=10,
lat=38.468950,
long=7.555783,
zoom=6)
Correct zero columns by retrieving the average longitude and latitude to supplement rows with no values.
def get_loc_mask(df, value=0):
mask = (df["latitude"] == 0) | \
(df["longitude"] == 0)
return mask
traffic_stations[["latitude","longitude"]].isin([0]).sum()
latitude 357 longitude 355 dtype: int64
sub_cols = spatial_columns + ["station_location", "station_id"]
zero_df = traffic_stations[get_loc_mask(traffic_stations)][sub_cols]
for state in tqdm(zero_df["fips_state_code"].unique()):
# Assign longitude and latitude values of closest stations
# (same fips and county) to zero values
ts_sub_df = traffic_stations[(traffic_stations["fips_state_code"] == state)]
ts_sub_df = ts_sub_df[sub_cols]
zero_sub_df = ts_sub_df[get_loc_mask(ts_sub_df)]
avg_loc = fips_loc_df[fips_loc_df["fips_code"] == state]
for county in zero_sub_df["fips_county_code"].unique():
county_df = ts_sub_df[ts_sub_df["fips_county_code"] == county]
nonzero_count = (county_df.shape[0] - county_df[["latitude",
"longitude"]].isin([0]).sum().max())
if nonzero_count != 0:
avg_lat = county_df["latitude"].sum()/nonzero_count
avg_long = county_df["longitude"].sum()/nonzero_count
else:
avg_lat = avg_loc["latitude"].item()
avg_long = avg_loc["longitude"].item()
zero_rows = county_df[get_loc_mask(county_df)].index
# Correct zero coordinates
traffic_stations.loc[zero_rows,"latitude"] = avg_lat
traffic_stations.loc[zero_rows,"longitude"] = avg_long
traffic_stations[["latitude", "longitude"]].describe()
| latitude | longitude | |
|---|---|---|
| count | 28466.000000 | 28466.000000 |
| mean | 38.653558 | 101.208962 |
| std | 6.064388 | 77.126073 |
| min | 4.151080 | 6.832740 |
| 25% | 34.155322 | 80.758000 |
| 50% | 39.463549 | 88.321659 |
| 75% | 42.305418 | 111.949251 |
| max | 99.900010 | 993.387110 |
fips_loc_df.describe()
| fips_code | latitude | longitude | |
|---|---|---|---|
| count | 55.000000 | 55.000000 | 55.000000 |
| mean | 31.872727 | 37.801189 | 94.592671 |
| std | 18.589179 | 8.726122 | 22.755662 |
| min | 1.000000 | 13.444300 | 64.896300 |
| 25% | 17.500000 | 34.763850 | 77.925700 |
| 50% | 31.000000 | 39.045800 | 89.398500 |
| 75% | 45.500000 | 42.741600 | 105.826100 |
| max | 78.000000 | 64.200800 | 170.132200 |
Correct the incorrect offset for the longitude and latitude entries.
condition = (traffic_stations["latitude"] < 20)
traffic_stations.loc[condition, "latitude"] *= 10
condition = (traffic_stations["latitude"] > 200)
traffic_stations.loc[condition, "latitude"] /= 10
condition = (traffic_stations["longitude"] < 20)
traffic_stations.loc[condition, "longitude"] *= 10
condition = (traffic_stations["longitude"] > 200)
traffic_stations.loc[condition, "longitude"] /= 10
visualize_stations(fips_state_code=10, zoom=7)
We can see here that while most of the station IDs are proprerly inside of South Dakota, there are still some misplaced stations. We can click on the circles on the map to check the details of these stations. One of the stations "000014" is located somewhere in Montreal. Since we are unsure of the proper offset for these set of values, we can apply thresholding based on the coordinates in fips_loc_df later on to disregard entries for these stations due to the nature of the unreliable data.
fips_state_ref[46]
'south dakota'
visualize_stations(fips_state_code=46, zoom=4)
sample_date = traffic_data["date"][0]
print(sample_date, type(sample_date))
2015-04-07 <class 'str'>
Convert date column to datetime.
traffic_data["date"] = pd.to_datetime(traffic_data["date"], format='%Y-%m-%d')
Since there are hourly entries, the scope of the dataset is the entirety of 2015 from January 1, 2015 until December 31, 2015. We can set aside the months of November and December for the test set if we are to do forecasting.
traffic_data["date"].min(), traffic_data["date"].max()
(Timestamp('2015-01-01 00:00:00'), Timestamp('2015-12-31 00:00:00'))
temporal_cols
['date', 'day_of_data', 'day_of_week', 'month_of_data', 'year_of_data']
Get sum across the hourly traffic volume.
traffic_data["daily_total_traffic_volume"] = traffic_data[traffic_vol_cols].sum(axis=1)
Check behavior across all states.
vol_aggr = traffic_data.groupby(by=["date"]).sum()
vol_aggr.sort_values(by=["date"], inplace=True)
vol_aggr[traffic_vol_cols]
| traffic_volume_counted_after_0000_to_0100 | traffic_volume_counted_after_0100_to_0200 | traffic_volume_counted_after_0200_to_0300 | traffic_volume_counted_after_0300_to_0400 | traffic_volume_counted_after_0400_to_0500 | traffic_volume_counted_after_0500_to_0600 | traffic_volume_counted_after_0600_to_0700 | traffic_volume_counted_after_0700_to_0800 | traffic_volume_counted_after_0800_to_0900 | traffic_volume_counted_after_0900_to_1000 | ... | traffic_volume_counted_after_1400_to_1500 | traffic_volume_counted_after_1500_to_1600 | traffic_volume_counted_after_1600_to_1700 | traffic_volume_counted_after_1700_to_1800 | traffic_volume_counted_after_1800_to_1900 | traffic_volume_counted_after_1900_to_2000 | traffic_volume_counted_after_2000_to_2100 | traffic_volume_counted_after_2100_to_2200 | traffic_volume_counted_after_2200_to_2300 | traffic_volume_counted_after_2300_to_2400 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date | |||||||||||||||||||||
| 2015-01-01 | 3223576 | 3181872 | 2223664 | 1575072 | 1239528 | 1439760 | 2031303 | 2500735 | 3387267 | 4676215 | ... | 9983020 | 9882681 | 9545403 | 8820429 | 7736599 | 6381873 | 5312341 | 4539452 | 3919722 | 2886716 |
| 2015-01-02 | 1610246 | 1217518 | 1009195 | 1111719 | 1763394 | 3573933 | 6380159 | 8984543 | 9437511 | 9923855 | ... | 14523395 | 15008699 | 15064688 | 14455224 | 11640016 | 9024410 | 7458470 | 6374717 | 5340897 | 4030416 |
| 2015-01-03 | 2577595 | 1825768 | 1455633 | 1316807 | 1542949 | 2250677 | 3453902 | 4773896 | 6416513 | 8355406 | ... | 12400808 | 12417839 | 12085252 | 11208297 | 10069097 | 8174116 | 6714904 | 6010771 | 5220084 | 3976941 |
| 2015-01-04 | 2453194 | 1787680 | 1364075 | 1073599 | 1142533 | 1614966 | 2395067 | 3274791 | 4621344 | 6524587 | ... | 11763580 | 11585401 | 11068517 | 10103532 | 8735058 | 7076017 | 5986404 | 4740410 | 3696621 | 2625419 |
| 2015-01-05 | 1465199 | 1022522 | 898255 | 1104483 | 2230505 | 5124057 | 9420121 | 12786833 | 11690509 | 9921969 | ... | 12122289 | 13579024 | 14592689 | 14653304 | 10961416 | 7795879 | 5945410 | 4998290 | 3832536 | 3261052 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2015-12-27 | 2746432 | 1958019 | 1614357 | 1346980 | 1413870 | 1946702 | 2842151 | 3841742 | 5529217 | 8067908 | ... | 12820827 | 12582825 | 12413939 | 11728202 | 10367385 | 8574065 | 7025073 | 5687269 | 4352633 | 3056108 |
| 2015-12-28 | 2026581 | 1445499 | 1217811 | 1388296 | 2410563 | 5038825 | 8313180 | 10492829 | 10391300 | 10426716 | ... | 13575288 | 14256048 | 14503034 | 13686523 | 10364570 | 7636341 | 6044691 | 5120364 | 3993751 | 2812931 |
| 2015-12-29 | 1879321 | 1384483 | 1217424 | 1388979 | 2316712 | 4762246 | 7986529 | 10280653 | 10208252 | 10192247 | ... | 12548751 | 13738041 | 14644379 | 14114836 | 10928467 | 8191586 | 6512636 | 5610674 | 4426930 | 3102064 |
| 2015-12-30 | 2099643 | 1511422 | 1314564 | 1463601 | 2403055 | 4888793 | 8275262 | 10690800 | 10730065 | 10834819 | ... | 14270295 | 14988696 | 15311121 | 14648317 | 11666444 | 8884863 | 7246736 | 6238423 | 4824663 | 3463416 |
| 2015-12-31 | 2390806 | 1714802 | 1446086 | 1488881 | 2206567 | 4239963 | 7079830 | 9201786 | 9359581 | 9942337 | ... | 14696881 | 14737214 | 13812258 | 12342209 | 10390556 | 8222022 | 6531695 | 5472410 | 4470386 | 3159873 |
365 rows × 24 columns
plt.plot(vol_aggr["daily_total_traffic_volume"])
plt.title("Total daily traffic volume recorded across \
all states in the dataset")
plt.xlabel("Date")
plt.ylabel("Traffic Volume")
plt.show()
Check the date/entry with the sudden dip.
vol_aggr[vol_aggr["daily_total_traffic_volume"] < 1e8]
| day_of_data | day_of_week | direction_of_travel | fips_state_code | lane_of_travel | month_of_data | record_type | restrictions | traffic_volume_counted_after_0000_to_0100 | traffic_volume_counted_after_0100_to_0200 | ... | traffic_volume_counted_after_1600_to_1700 | traffic_volume_counted_after_1700_to_1800 | traffic_volume_counted_after_1800_to_1900 | traffic_volume_counted_after_1900_to_2000 | traffic_volume_counted_after_2000_to_2100 | traffic_volume_counted_after_2100_to_2200 | traffic_volume_counted_after_2200_to_2300 | traffic_volume_counted_after_2300_to_2400 | year_of_data | daily_total_traffic_volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date | |||||||||||||||||||||
| 2015-03-08 | 59488 | 7436 | 29772 | 232223 | 8045 | 22308 | 22308 | 0.0 | 1230385 | 780863 | ... | 5711736 | 5294911 | 4651611 | 3909009 | 3114447 | 2367735 | 1672892 | 1008443 | 111540 | 74339394 |
1 rows × 34 columns
Retrieves hourly rows and transforms it into a 1D array.
hourly_vol_aggr = vol_aggr[traffic_vol_cols].to_numpy().ravel()
hourly_vol_aggr.shape
(8760,)
timestamps = []
hour_delta = [np.timedelta64(hour, 'h') for hour in range(0, 24)]
dates = list(vol_aggr.index)
for date in dates:
# Apply hour offset in date
timestamps += [date + hour for hour in hour_delta]
hourly_aggr_df = pd.DataFrame()
hourly_aggr_df["traffic_volume_aggr"] = hourly_vol_aggr
hourly_aggr_df.index = timestamps
date_limit = "2015-02-01"
plt.plot(hourly_aggr_df[hourly_aggr_df.index < date_limit])
plt.title(f"Total traffic volume per hour recorded across all \
states in the dataset until {date_limit}")
plt.xlabel("Date")
plt.ylabel("Traffic Volume")
plt.show()
Animation code is based on the code here
def get_daily_volume(fips_state_code):
# To be used within the notebook.
# Will require other input parameters like traffic_data if used as util func
sub_df = traffic_data[(traffic_data["fips_state_code"] == fips_state_code)]
series = sub_df.groupby(by=["date"]).mean()
return series["daily_total_traffic_volume"]
y1 = get_daily_volume(4)
y2 = get_daily_volume(12)
y3 = get_daily_volume(6)
Uncomment to retrieve the sample animation. Based on this
# plt.rcParams["figure.facecolor"] = "w"
# import numpy as np
# import matplotlib.pyplot as plt
# import matplotlib.animation as animation
# x = y3.index
# fig, ax = plt.subplots()
# line1, = ax.plot(x, y1,
# label=fips_state_ref[4].upper())
# line2, = ax.plot(x, y2,
# label=fips_state_ref[12].upper())
# line3, = ax.plot(x, y3,
# label=fips_state_ref[6].upper())
# def update(num, x, y1, y2, y3,
# line1, line2, line3):
# line1.set_data(x[:num], y1[:num])
# line2.set_data(x[:num], y2[:num])
# line3.set_data(x[:num], y3[:num])
# return [line1,line2,line3]
# ani = animation.FuncAnimation(fig, update, len(x), fargs=[x, y1, y2, y3,
# line1, line2, line3],
# interval=50, blit=True)
# ax.set_xlabel("Date")
# ax.set_ylabel("Average Traffic Volume across stations")
# plt.title("Sample aggregated traffic volume plots")
# plt.legend()
# ani.save(f"{DATA_LOCATION}/sample_traffic.gif")
# plt.show()
Graphs per hour for each day in January 2015 are superimposed. It can be seen that there appears to be common trends per hour.
date_limit = "2015-01-" # Can be modified
for i in range(1,31):
lower_bound = date_limit + str(i).zfill(2)
upper_bound = date_limit + str(i+1).zfill(2)
plt.plot(hourly_aggr_df[(lower_bound <= hourly_aggr_df.index) &
(hourly_aggr_df.index < upper_bound)].values)
plt.title(f"Total traffic volume per hour recorded across all \
states in the dataset for January 2015 \n(Comparison of daily data)")
plt.xlabel("Hour")
plt.ylabel("Traffic Volume")
plt.show()
During different parts of the day, the trends for the traffic volume differs e.g. for hours during midnight to early morning (0-5AM), it can be seen that the traffic is low since most human activities such as regular office hours and school occur during morning to afternoon. This can be further verified by seeing the spike in traffic volume around midmornings (7-9AM) wherein public transportation such as buses/taxis and private vehicles are being used to go to schools, business establishments, offices, and others.
After stable traffic volumes during early afternoon (11AM-3PM), there are sudden spikes as people most likely return home after their time outside and slowly winding down further as it goes on to the night.
Because of this, we can add parts of day as a feature for our models when forecasting the data.
Since the day of week values in the dataset are encoded numerically, we can check the named equivalent for the day of week as shown below.
day_of_week_ref = dict()
for day in sorted(traffic_data["day_of_week"].unique()):
sample = traffic_data[traffic_data["day_of_week"] == day].iloc[0]
day_name = sample["date"].strftime('%A')
idx = sample["day_of_week"]
print("Date: ", sample["date"])
print("Day: ", day_name)
print("Day of week value: ", idx, end="\n\n")
day_of_week_ref[idx] = day_name
Date: 2015-04-26 00:00:00 Day: Sunday Day of week value: 1 Date: 2015-11-23 00:00:00 Day: Monday Day of week value: 2 Date: 2015-04-07 00:00:00 Day: Tuesday Day of week value: 3 Date: 2015-02-18 00:00:00 Day: Wednesday Day of week value: 4 Date: 2015-09-10 00:00:00 Day: Thursday Day of week value: 5 Date: 2015-06-26 00:00:00 Day: Friday Day of week value: 6 Date: 2015-09-26 00:00:00 Day: Saturday Day of week value: 7
dow_vol_aggr = traffic_data.groupby(by=["day_of_week"]).mean()
From the plot below we can see that the traffic volume for weekends (Sunday and Saturday) are lower compared to their weekday counterparts. Additionally, there seems to be a slightly higher volume during Friday.
x = [day_of_week_ref[idx] for idx in dow_vol_aggr.index]
plt.plot(x, dow_vol_aggr["daily_total_traffic_volume"])
plt.xlabel("Day of Week")
plt.ylabel("Traffic Volume")
plt.title("Average traffic volume per day of week across all states")
plt.show()
We can further check the behavior of the traffic volume per hour during the day of week. We can see here that there is a significantly lower traffic volume during early mornings during the weekend. This may be attributed to activities such as schools and regular office hours only occuring during the weekdays, thus lowering the traffic volume during weekends.
x = range(0, 24)
for idx in dow_vol_aggr.index:
plt.plot(x, dow_vol_aggr[traffic_vol_cols].loc[idx],
label=day_of_week_ref[idx])
plt.legend()
plt.xlabel("Hour")
plt.ylabel("Traffic Volume")
plt.show()
wd = dow_vol_aggr[(dow_vol_aggr.index < 7) & (1 < dow_vol_aggr.index)][traffic_vol_cols].mean()
we = dow_vol_aggr[(dow_vol_aggr.index == 7) | (1 == dow_vol_aggr.index)][traffic_vol_cols].mean()
plt.plot(x, wd, label="Weekday average")
plt.plot(x, we, label="Weekend average")
plt.xlabel("Hour")
plt.ylabel("Traffic Volume")
plt.legend()
plt.title("General comparison of average hourly volume for Weekdays and Weekends")
plt.show()
Use Augmented Dickey Fuller test to check for stationarity for the data. If the p-value is relatively low, we can opt to not transform the data further prior to modeling.
def adf_test(values):
adf = adfuller(values, autolag='AIC')
adf_output = pd.Series(adf[1:3], index=["p-value", "Number of lags used"])
adf_output["ADF Test Statistic"] = adf[0]
for percent, val in adf[4].items():
adf_output[f"Critical value {percent}"] = val
return adf_output
def get_hourly_volume(traffic_volume, date_entries):
hourly_vol = traffic_volume.to_numpy().ravel()
timestamps = []
hour_delta = [np.timedelta64(hour, 'h') for hour in range(0, 24)]
dates = list(date_entries)
for date in dates:
# Apply hour offset in date
timestamps += [date + hour for hour in hour_delta]
hourly_df = pd.DataFrame()
hourly_df["traffic_volume_aggr"] = hourly_vol
hourly_df.index = timestamps
return hourly_df
adf_test(hourly_aggr_df)
p-value 2.201661e-24 Number of lags used 3.700000e+01 ADF Test Statistic -1.304494e+01 Critical value 1% -3.431100e+00 Critical value 5% -2.861871e+00 Critical value 10% -2.566946e+00 dtype: float64
adf_test(vol_aggr["daily_total_traffic_volume"])
p-value 0.128263 Number of lags used 14.000000 ADF Test Statistic -2.449449 Critical value 1% -3.449173 Critical value 5% -2.869833 Critical value 10% -2.571188 dtype: float64
Modify the parameters below as needed for analysis.
fips_state_code = 12
station_id = "930198"
traffic_data[traffic_data["fips_state_code"] ==
fips_state_code]["station_id"].value_counts()
930198 8640
930174 8364
140190 5776
930101 5318
150302 5082
...
170361 489
109926 405
570385 112
290320 102
860377 76
Name: station_id, Length: 274, dtype: int64
sub_df = traffic_data[(traffic_data["fips_state_code"] == fips_state_code) &
(traffic_data["station_id"] == station_id)]
sub_df_aggr = sub_df.groupby(by=["date"]).sum()
plt.plot(sub_df_aggr["daily_total_traffic_volume"])
title = f"Daily traffic volume collected by station {station_id} in "
title += fips_state_ref[fips_state_code].upper()
plt.title(title)
plt.show()
adf_test(sub_df_aggr["daily_total_traffic_volume"])
p-value 0.025923 Number of lags used 13.000000 ADF Test Statistic -3.108605 Critical value 1% -3.449173 Critical value 5% -2.869833 Critical value 10% -2.571188 dtype: float64
hourly_sub_aggr = get_hourly_volume(sub_df_aggr[traffic_vol_cols],
sub_df_aggr.index)
adf_test(hourly_sub_aggr)
p-value 2.498210e-25 Number of lags used 3.600000e+01 ADF Test Statistic -1.354243e+01 Critical value 1% -3.431102e+00 Critical value 5% -2.861872e+00 Critical value 10% -2.566947e+00 dtype: float64
The following plots checks the timeseries decomposition for the hourly traffic volume for the data collected from the given station in the state across a daily period. For hourly_sub_aggr, the entries are transformed to be hourly instead of daily.
Based on the default values in this notebook, hourly_sub_aggr was collected from station "930198" in the state of Florida (FIPS state code 12).
# Can be removed completely/set as 2016-01-01 if to be run for the whole year.
# Limit is placed to make the plots visible.
limit = "2015-03-01"
with mpl.rc_context():
mpl.rc("figure", figsize=(12,8))
volume = seasonal_decompose(hourly_sub_aggr[hourly_sub_aggr.index < limit],
model='additive',
period=24) # hours in the given dataframe
volume.plot()
plt.show()
The following shows the timeseries decomposition for the daily total traffic volume to check for trends on a weekly interval. Since the vol_aggr has daily entries, the period is set to 7.
with mpl.rc_context():
mpl.rc("figure", figsize=(12,8))
volume = seasonal_decompose(vol_aggr["daily_total_traffic_volume"],
model='additive',
period=7)
volume.plot()
plt.show()
Create a module for quick checking of station plots across states.
import folium
from IPython.display import HTML, display
import ipywidgets as widgets
dropdown_states = widgets.Dropdown(options = fips_df["state_name"].unique())
frame = widgets.Output()
long_neg=True
def dropdown_states_eventhandler(state):
frame.clear_output()
with frame:
fips_state_code = fips_df[fips_df["state_name"] ==
state.new]["fips_code"].item()
visualize_stations(fips_state_code, zoom=5)
dropdown_states.observe(dropdown_states_eventhandler, names="value")
display(dropdown_states)
display(frame)
state_aggr = traffic_data.groupby(by="fips_state_code")
vol_dict = dict(state_aggr["daily_total_traffic_volume"].mean())
state_vol_df = pd.DataFrame(vol_dict.items(),
columns=["fips_state_code", "avg_daily_traffic_volume"])
state_vol_df["state_name"] = [fips_state_ref[fips] for
fips in vol_dict.keys()]
state_vol_df.sort_values(by="avg_daily_traffic_volume", ascending=False)
| fips_state_code | avg_daily_traffic_volume | state_name | |
|---|---|---|---|
| 2 | 4 | 41975.350252 | arizona |
| 40 | 45 | 35202.631662 | south carolina |
| 4 | 6 | 34912.276425 | california |
| 20 | 24 | 24873.391630 | maryland |
| 39 | 44 | 23741.895243 | rhode island |
| 21 | 25 | 23640.604401 | massachusetts |
| 6 | 9 | 22781.689851 | connecticut |
| 8 | 11 | 20597.728118 | district of columbia |
| 46 | 51 | 18807.749916 | virginia |
| 43 | 48 | 17378.775613 | texas |
| 29 | 33 | 16553.653221 | new hampshire |
| 22 | 26 | 14792.572743 | michigan |
| 13 | 17 | 13274.660273 | illinois |
| 37 | 41 | 12444.945584 | oregon |
| 5 | 8 | 11652.683984 | colorado |
| 30 | 34 | 11495.154961 | new jersey |
| 23 | 27 | 11197.723415 | minnesota |
| 11 | 15 | 10601.726766 | hawaii |
| 33 | 37 | 10538.482679 | north carolina |
| 47 | 53 | 10144.434390 | washington |
| 10 | 13 | 9677.041353 | georgia |
| 35 | 39 | 9500.974396 | ohio |
| 27 | 31 | 9229.531058 | nebraska |
| 0 | 1 | 7956.350093 | alabama |
| 9 | 12 | 7938.545437 | florida |
| 25 | 29 | 7831.823536 | missouri |
| 44 | 49 | 7086.505518 | utah |
| 45 | 50 | 7043.689097 | vermont |
| 32 | 36 | 7028.233385 | new york |
| 17 | 21 | 6973.289729 | kentucky |
| 14 | 18 | 6800.458990 | indiana |
| 36 | 40 | 6779.391472 | oklahoma |
| 38 | 42 | 6567.119494 | pennsylvania |
| 28 | 32 | 6548.635374 | nevada |
| 49 | 55 | 6347.635749 | wisconsin |
| 18 | 22 | 6240.193409 | louisiana |
| 7 | 10 | 5912.847166 | delaware |
| 42 | 47 | 5784.233498 | tennessee |
| 24 | 28 | 4869.736799 | mississippi |
| 12 | 16 | 4591.650662 | idaho |
| 48 | 54 | 4507.809121 | west virginia |
| 3 | 5 | 4477.215506 | arkansas |
| 19 | 23 | 4369.733408 | maine |
| 15 | 19 | 4264.327758 | iowa |
| 1 | 2 | 3651.447390 | alaska |
| 16 | 20 | 3468.818230 | kansas |
| 31 | 35 | 3361.497400 | new mexico |
| 34 | 38 | 2643.543330 | north dakota |
| 41 | 46 | 2513.988838 | south dakota |
| 26 | 30 | 2042.695231 | montana |
| 50 | 56 | 1894.701995 | wyoming |
station_per_state = dict(state_aggr["station_id"].unique())
station_count = [len(station_per_state[state]) for state in station_per_state.keys()]
state_vol_df["station_count"] = station_count
There are many possible reasons for high daily traffic volume such as density of population in the state, amount of stations collecting data and coordinates, amount of roads or major highways across the state providing interstate travel, establishments for leisure and providing necessities as well as offices and schools in an area, and many others.
Check correlation between the number of stations in a given state and the average daily traffic volume.
state_vol_df[["station_count","avg_daily_traffic_volume"]].corr()
| station_count | avg_daily_traffic_volume | |
|---|---|---|
| station_count | 1.000000 | 0.273404 |
| avg_daily_traffic_volume | 0.273404 | 1.000000 |
sns.pairplot(state_vol_df[["station_count","avg_daily_traffic_volume"]])
plt.show()
# Modify as needed
for fips_state_code in [4, 45, 6]:
print(f"{fips_state_ref[fips_state_code].upper()}:")
comp_cols = ["station_id", "fips_state_code"]
loc_comp_df = traffic_data[traffic_data["fips_state_code"] ==
fips_state_code]
loc_comp_df = pd.merge(loc_comp_df,
traffic_stations[["station_id"] +
spatial_columns],
on=comp_cols)
loc_comp_df = loc_comp_df[["daily_total_traffic_volume",
"latitude","longitude"]].drop_duplicates()
print("Num of unique latitude: ", loc_comp_df["latitude"].unique().shape[0])
print("Num of unique longitude: ", loc_comp_df["longitude"].unique().shape[0])
# We're mostly interested in the correlation for the traffic volume
display(loc_comp_df.corr())
# Get pair plot to visualize correlation
sns.pairplot(loc_comp_df[["daily_total_traffic_volume",
"latitude","longitude"]])
plt.show()
ARIZONA: Num of unique latitude: 275 Num of unique longitude: 273
| daily_total_traffic_volume | latitude | longitude | |
|---|---|---|---|
| daily_total_traffic_volume | 1.000000 | -0.199628 | 0.122512 |
| latitude | -0.199628 | 1.000000 | 0.201931 |
| longitude | 0.122512 | 0.201931 | 1.000000 |
SOUTH CAROLINA: Num of unique latitude: 130 Num of unique longitude: 130
| daily_total_traffic_volume | latitude | longitude | |
|---|---|---|---|
| daily_total_traffic_volume | 1.000000 | 0.073801 | 0.11652 |
| latitude | 0.073801 | 1.000000 | 0.45220 |
| longitude | 0.116520 | 0.452200 | 1.00000 |
CALIFORNIA: Num of unique latitude: 281 Num of unique longitude: 280
| daily_total_traffic_volume | latitude | longitude | |
|---|---|---|---|
| daily_total_traffic_volume | 1.000000 | -0.396058 | -0.252037 |
| latitude | -0.396058 | 1.000000 | 0.890279 |
| longitude | -0.252037 | 0.890279 | 1.000000 |
Check highest daily traffic volume across the dataset.
daily_vol_stats = traffic_data["daily_total_traffic_volume"].describe()
daily_vol_stats
count 7.140391e+06 mean 1.059225e+04 std 1.673815e+04 min -1.400000e+01 25% 2.399000e+03 50% 5.900000e+03 75% 1.237100e+04 max 1.522379e+06 Name: daily_total_traffic_volume, dtype: float64
sub_cols = ["fips_state_code",
"functional_classification_name",
"station_id"]
traffic_data[traffic_data["daily_total_traffic_volume"] ==
daily_vol_stats["max"]][["date"] + sub_cols]
| date | fips_state_code | functional_classification_name | station_id | |
|---|---|---|---|---|
| 5643609 | 2015-05-31 | 36 | Urban: Collector | 003480 |
fips_state_ref[36]
'new york'
traffic_stations[(traffic_stations["fips_state_code"] == 36) &
(traffic_stations["station_id"] == "003480")][spatial_columns]
| fips_state_code | fips_county_code | latitude | longitude | functional_classification_name | |
|---|---|---|---|---|---|
| 8822 | 36 | 75 | 43.300598 | 76.152061 | Urban: Collector |
| 26555 | 36 | 75 | 43.300598 | 76.152061 | Urban: Collector |
Check the general trend for urban and rural areas.
ur_aggr = traffic_data.groupby(by="functional_classification_name")
ur_comparison = ur_aggr["daily_total_traffic_volume"].mean().sort_values()
We can see here that compared to rural traffic volumes, urban areas typically have higher average traffic volume across all states. However, there are some values such as "Rural: Principal Arterial - Interstate" which has higher average daily traffic volume but is still much lower compared to it's urban counterpart "Urban: Principal Arterial - Interstate"
plt.figure(figsize=(8,8))
plt.barh(ur_comparison.index, ur_comparison.values)
plt.xlabel("Average Daily Traffic Volume")
plt.show()
During our initial analysis, we can see from the data that there are hourly entries for traffic volume data collected by stations for a given state in the US for the entire year of 2015. Per state, there are counties and more spatial information such as urban vs. rural, longitude, and latitude values. Upon checking the data, there are gaps between daily entries but no null values were found for the hourly entries. While there are 0 and negative values, we assume that the sensors for each station are properly calibrated and leave those values untouched as these values may be intentional per station. We also assume that while stations may have different sensors, the traffic volume data entries are normalized to be of the same unit in the dataset.
While checking the traffic stations, it was seen that there were some entries with common patterns of incorrect longitude and latitude data. This was verified by collecting external data by matching the FIPS state codes to their approximate longitude and latitude. These values were corrected with an offset and it was seen through visualization that these corrected values are more appropriate given their state location. However, it was seen that there are still some stations with anomalous values. Since these values cannot be corrected by a simple multiplier, it is noted that it is possible to leave out these stations during visualization or forecasting due to the unreliable nature of the data. Other incorrect data points were corrected during analysis and some columns were left out since they mostly contained information regarding the stations and often had NaN values--these columns are noted for future exploration.
Afterwards, we check the generalized temporal behavior of the data and see trends regarding hour of day, part of day, day of week. Statistical tests were also applied and it was seen that the p-value of the hourly traffic volume time series data has a low p-value which indicates it being stationary and not needing further pre-processing. However it is noted that the daily time series data for the traffic volume is relatively high so forecasting for daily traffic volume may need pre-processing to adjust these values (e.g. via log).
Lastly, spatial analysis was done by comparing the traffic trends and amount of stations that collected data per state. Traffic volume for states and possible correlation given longitude and latitude values were also explored. It was also confirmed that urban areas have higher daily traffic volume compared to rural areas and may be used as features for modeling.
After our EDA, we can move onto feature engineering and data pre-processing to prepare our time series data for forecasting.